QTM 447 Lecture 13: Using CNNs

Kevin McAlister

February 25, 2025

CNNs

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

Images represent a structured input that are difficult for many machine learning methods

  • Each colored instance is a \(3 \times H \times W\) tensor input

  • Location matters! Images are all about spatial context - a cat is a cat regardless of which way it is facing!

  • A lot of “features” per instance - a \(3 \times 32 \times 32\) image has 3072 pixel values!

CNNs

The solution: Convolution Layers

  • Convolve the image with a filter of size \(C \times F \times F\) with learned parameters

  • Each filter learns some part of the image (edges, innards, etc.)

With enough filters in a layer, break an image down into its constituent parts!

CNNs

CIFAR-10 Data Set:

  • 50,000 instances of \(32 \times 32 \times 3\) RGB images

  • A tougher image task than digits

CNNs

The CNN structure (AlexNet alike):

Inputs Layer Output
Channels H/W Filters FSize Stride Pad Channels H/W
Conv1 3 32 64 7 1 Same 64 32
MaxPool1 64 32 2 2 0 64 16
Conv2 64 16 128 5 1 Same 128 16
MaxPool2 128 16 2 2 0 128 8
Conv3 128 8 256 3 1 Same 256 8
MaxPool3 256 8 2 2 0 256 4
Flatten 256 4 1024
FNN4 1024 512 512
FNN5 512 512 512
FNN6 512 10 10

CNNs

We start with a rich image.

  • In the first convolution layer, we create high level feature maps for different aspects of the images

  • The pooling layers reduce the dimensionality of those feature maps (too much redundancy/a lot of white space due to ReLU)

  • Second layer convolves in and across the feature maps (further split up the features)

  • Pool/Convolve/Pool

  • Flatten and then do a 3 layer NN for the values!

CNNs

Each successive convolution/pooling layer downsamples the feature map!

  • As the maps get more granular to specific features, we need less resolution since we’re only localizing to neighborhoods of the images

Make up for downsample with more filters!

  • More or less preserve the original volume of the original images!

CNNs

Before we show off this structure, one more “layer” - normalization layers

Deep CNNs tend to have problems with vanishing gradients

  • Each filter is trying to learn a small part of the image.

  • The gradients get really small when the filters get specific

CNNs

A solution that works (without much of a theoretical basis as to why it works) is to use batch normalization

At any step, we have a 4D tensor of feature maps:

\[\underset{\# of Training Instances}{N} \times \underset{\# of Channels}{C} \times \underset{Height in px}{H} \times \underset{Width in px}{W}\]

For a single channel:

\[\underset{\# of Training Instances}{N} \times \underset{Height in px}{H} \times \underset{Width in px}{W}\]

CNNs

Goal: Normalize the output for each channel so that the have zero mean and unit variance!

Why?

  • Improves optimization.

Why?

  • It just does.

CNNs

For a single channel:

\[\underset{\# of Training Instances}{N} \times \underset{Height in px}{H} \times \underset{Width in px}{W}\]

\[\mu_c = \frac{1}{N \times H \times W} \sum \limits_{i,j,k} x_{i,j,k}\]

\[\sigma^2_c = \frac{1}{N \times H \times W} \sum \limits_{i,j,k} (x_{i,j,k} - \mu_c)^2\]

\[\hat{x}_{i,j,k} = \frac{x_{i,j,k} - \mu_c}{\sqrt{\sigma^2_c + \epsilon}}\]

\[y_{i,j,k} = \gamma_c \hat{x}_{i,j,k} + \delta_c\]

  • \(\gamma_c\) and \(\delta_c\) are shift and scale parameters that are learned for each channel via backprop

CNNs

A normalization layer is typically placed after a convolution layer

  • We can also normalize after any fully connected layers (it also helps there!)

Requires a slightly different operation at test time

  • Outlined in chapter 14 of PML1

  • Handled natively by PyTorch

CNNs

Let’s try all this out for the CIFAR10 data!

CNNs

AlexNet gets us to around 75% in validation accuracy!

  • For a 10 class problem, this is pretty good!

CNNs

One thing that we might want to do is visualize what each filter corresponds to in the original images

  • Or try, at least

This can be really tough when we have multiple convolutional layers!

  • The bottom layers correspond to little pieces of the images

  • Higher layers correspond put the little pieces together into bigger pieces

CNNs

Approach 1: Exemplars

At any layer, compute the hidden values for image \(i\) after using filter \(j\)

Find the images that correspond with the largest value in at that filter!

  • Large values correspond to high “weight” at a filter

Look at top x and find commonalities to see what we can see.

CNNs

Approach 2: Activation Maximization

At the bottom of the network, input is a \(C \times H \times W\) tensor of pixel values

  • Start with a random set of pixels

  • Feed this through the trained network until we get to the hidden representations at filter \(j\)

  • Compute the gradient at the filter for an input image

  • Ascend the gradient to increase the activation value!

  • Repeat over and over again.

CNNs

Approach 2: Activation Maximization

In theory, this approach yields some visual representation of what image would activate most heavily at a specific filter.

  • Eventually.

Somewhat costly in practice

  • A lot of evaluations needed!

CNNs

AlexNet does well on this data set

  • But we can do better

We can improve this by using a deeper architecture

CNNs

CNN Architecture (VGG) Rules:

  • All convolutions are same 3x3 convolutions with stride 1

  • All max pools are 2x2 with stride 2

  • After pooling, double the number of filters/channels

Let’s go back over to the notebook to see what this looks like.

CNNs

A little better, but still not state of the art!

The problem is that our networks are quickly overfitting to the data

  • For CNNs, we need to force our models to generalize better by changing up the training procedure.

Two dominant approaches to get better generalization for CNNs:

  • Dropout

  • Data Augmentation

CNNs

As we saw before, dropout is the act of randomly turning off connections in the network with some probability.

  • The idea is that it forces the procedure to learn an ensemble of bad networks that work pretty well once averaged together

  • Reduce reliance on any one convolution or neuron

Let’s apply this to our VGG model to see if it works a little better

  • Aggressive dropout to force a generalizable model!

CNNs

A clever image-specific approach is data augmentation

CNNs

A cat is a cat is a cat

  • Even if it’s flipped over

  • Even if it’s closer to the camera

  • Even if it is facing left or right

Data Augmentation supplements the training set with new images that are random variations on the original images!

  • The idea is that the new variations improve the ability of the model handle different orientations of cats/dogs/airplanes/etc.

CNNs

Pretty easy to implement in PyTorch with Torchvision

CNNs

Data Augmentation slows down the training procedure!

  • More iterations needed to get to the minimum because the training data is constantly changing

  • The benefit is that the model becomes quite robust to small changes in images and can do a pretty good job of generalizing to new pictures!

We eek out more performance.

  • Could probably be a little better if we ran it longer, but not much better.

ResNets

Looking at this data, we didn’t see the improvement that we would’ve hoped to see by making the model deeper

  • AlexNet to VGG wasn’t a huge increase in power. It gave us about 10% more accuracy.
  • Why can’t we get close to 100?

This is a common phenomenon in CNNs

  • The more layers we add, vanilla CNNs don’t seem to improve all that much…

ResNets

This doesn’t jive with what we understand about DNNs

  • As the DNN gets deeper, we should see pretty good improvement in novel image problems

  • In fact, there is common phenomenon that really deep CNNs actually perform worse than shallow CNNs

  • This shouldn’t happen because a shallow network with a new layer should, at worst, do as well as the shallow model!

ResNets

The problem is one of optimization rather than theoretical!

  • The gradient disappears really quickly in deep CNNs

A novel solution to this problem is the residual network (He et al, 2016)

ResNets

The basis of the ResNet is the residual block

ResNets

We’re adding the new info learned via the convolution layers within the block to the original input

  • Huh?

This works because it is asking each residual block to learn some small part of the overall mapping of the original features to the outcome

  • \(y = \mathbf x + g(\alpha) + h(\beta) + ....\)

With enough little parts, we’re able to complete the mapping!

ResNets

This architecture makes the gradients much less likely to disappear

  • The gradients flow directly from the output to the earlier layers

  • Since each block is learning its own little bit of the mapping of \(\mathbf x\) to y, the gradient for each residual block doesn’t depend on the depth

  • In other words, the gradients in each layer aren’t stacking and multiplying - we’re just adding!

ResNets

A Vanilla CNN:

\[ x_k = f_k \circ f_{k-1} \circ \cdots \circ f_1(x) \]

After chain rule:

\[ \frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial x_k} \cdot \prod_{i=1}^{k} \left( \phi_i'(z_i) \, W_i \right) \]

If the spectral norm of each $ _i’(z_i) , W_i $ is less than 1, then as $ n $ increases the product can shrink exponentially. This leads to the vanishing gradient problem.

ResNets

ResNets:

\[ H(x) = F(x) + x \]

The gradient:

\[ \frac{\partial H(x)}{\partial x} = \frac{\partial \left( F(x) + x \right)}{\partial x} = \frac{\partial F(x)}{\partial x} + \mathcal I \]

ResNets

After chain rule:

\[ \frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial H(x)} \cdot \left( \frac{\partial F(x)}{\partial x} + \mathcal I \right) \]

With many layers:

\[ \frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial x_n} \cdot \prod_{i=1}^{k} \left( \mathcal I + J_{F_i}(x_i) \right) \]

ResNets

Even if \(J_{F_i} (x_i)\) has eigenvalues less than 1, the identity matrix ensures that each term \(\mathcal I + J_{F_i}(x_i)\) is closer to the identity matrix.

  • This allows the gradient to propagate more directly and mitigates the vanishing gradient problem.

Where have we seen this idea before?

ResNets

Architecture:

  • Aggressive stem with a \(f \times f\) filter and pooling stage that cuts the initial image size by 75%; 64 filters

  • 3 residual blocks with 64 filters

  • 3 residual blocks with 128 filters

  • 3 residual blocks 256 filters

  • 3 residual blocks with 512 filters

  • Global average pooling and a single multinomial layer

ResNets

Resnets also leverage some concepts popularized by Google in 2015 with GoogLeNet (an homage to LeNet)

No fully connected hidden layers at the top of the network!

  • Instead, the values in the feature map for each channel is averaged after the last convolution layer - this is called global average pooling

  • The idea is that the series of residual blocks has made the feature maps at the top layer so sparse that each image, more or less, corresponds to a few of the feature maps. All we need to know if which ones to predict what it is!

ResNets

This architecture, more or less, match the performance of VGG without ResNets.

  • Any guess as to why?

  • Why do Residual Connections in the first place?

ResNets

Even with my sick GPU, I can’t really train a model with 50 layers!

  • What to do?

CNN Architectures

Where do these architectural choices come from?

The ImageNet Challenge!

ImageNet is a large data base of of more than 14 million hand annotated images

  • What objects are in the images?

  • For multi-object images, what is the main image?

  • More than 20,000 categories!

CNN Architectures

CNN Architectures

ImageNet Challenge:

  • For a standardized subset of ImageNet, predict the top-5 categories (in terms of softmax) for each image

  • If the true label for a test set is in the top 5, success!

CNN Architectures

CNN Architectures

This is a really hard problem!

  • The training set is really big - millions of images

  • The challenge set is 1,000 classes

Modern CNN architectures require serious compute power!

  • Millions and millions of parameters (still way less than FCNNs)

These models perform really well on this tough challenge!

CNN Architectures

CNN Architectures

You could replicate these results for the ImageNet challenge yourself!

  • If you had millions of dollars and infinite compute resources

These architectures really push the limits of what computers can do

  • Often multiple GPUs running for days

CNN Architectures

ImageNet is a broad database with lots of different tagged images

  • Complicated models with lots of convolution layers create a large set of feature extractors

  • Take in the image, break it down to its constituent parts, use these parts to create class predictions

A thought: an image of a cat is an image of a cat no matter how it was taken

An image of a cat that was not seen in the training data is probably just some function of the different feature extractors learned by the CNN!

  • A mouse is probably just some combination of a cat and a rat - a smaller furrier rat

CNN Architectures

A team with millions of dollars has already trained a high performance feature extractor on ImageNet

  • Broad base

  • Performs well for a lot of different images

Is it possible for us to use that feature extractor to avoid needing to train the feature extractors (e.g. convolution layers) ourselves?

Transfer Learning

For image analysis, transfer learning is the norm

Transfer Learning

Transfer Learning

Transfer learning works!

  • The bottom layers learn to extract edges of images

  • As we move up, we learn to combine edges and innards to get small blocks of images

  • The process is pretty subject agnostic - an image is just a collection of lines and colors!

No need to keep retraining a method to extract lines and colors in patches from a dense image!

Transfer Learning

Process:

  • Take pretrained weights from a model

  • Pass input images for your task through this deep feature extractor

  • For each image, record the feature map at the last step

  • Use this feature map to train a FCNN that will categorize your data!

We’re just using a well-tuned feature extractor!

  • Just like using an agreed upon model to pre-process the images

Transfer Learning

Transfer Learning

Let’s look at using pre-trained ResNet125

  • See notebook

A few notes about transfer:

  • This is still very memory intensive. We don’t have to backprop through the convolutional layers, but we do have to compute the feature maps for all of the images! This is costly and requires a lot of resources.

  • Augmenting your data can help with the training. However, the pre-trained model will have used augmentation, dropout, etc. on training. It will eek out a little more performance!

Transfer Learning

Transfer learning is the norm!

  • Images are images are images (just like songs are just pieces of other songs)

  • SoTA CNNs are learning how to break images into parts. The feature extractor just does this in a really clever way.

  • It is almost always a waste of time to not use a pre-trained extractor! If your image set is even a little similar to ImageNet (almost all are), you’ll do better using a pre-trained feature extractor.

Transfer Learning

Transfer learning opens up a world of state of the art image analysis techniques

  • Image segmentation

  • Object detection

  • Semantic segmentation

We’ll talk about these in our next class!